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Abstract 



Support vector machines (SVMs) are special kernel based methods 
and belong to the most successful learning methods since more than 
^i ' a decade. SVMs can informally be described as a kind of regularized 

M-estimators for functions and have demonstrated their usefulness 
K^ , in many complicated real-life problems. During the last years a great 

^\j ' part of the statistical research on SVMs has concentrated on the ques- 

^sQ . tion how to design SVMs such that they are universally consistent 

and statistically robust for nonparametric classification or nonpara- 
^^ metric regression purposes. In many applications, some qualitative 

f— ^ ■ prior knowledge of the distribution P or of the unknown function / to 

^D , be estimated is present or the prediction function with a good inter- 

pretability is desired, such that a semiparametric model or an additive 
model is of interest. 

^ , In this paper we mainly address the question how to design SVMs 

H I by choosing the reproducing kernel Hilbert space (RKHS) or its cor- 

responding kernel to obtain consistent and statistically robust estima- 
tors in additive models. We give an explicit construction of kernels — 
and thus of their RKHSs — which leads in combination with a Lip- 
schitz continuous loss function to consistent and statistically robust 
SMVs for additive models. Examples are quantile regression based 
on the pinball loss function, regression based on the e-insensitive loss 
function, and classification based on the hinge loss function. 

KEYWORDS: Support Vector Machine, SVM, additive model, consis- 
tency, robustness, kernel 
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1 Introduction 



Kernel methods such as support vector machines bel ong to the most suc- 
cessfu l learning methods since more than a decade, see IScholkopf and Smola 
(120021 ). Examples are classification or regression models where we have an 
input space X, an output space 3^, some unknown probability measure P on 
X xy, and an unknown function / : A* — )■ R which describes the quantity of 
interest, e.g. the conditional quantile curve, of the conditional distribution 
of P(-|x), X G X. Support vector machines can informally be described as a 
kind of regularized M-estimators for functions and have demonstrated their 
usefulness in many complicated high- dimensional real-life problems. Besides 
several other nice featur es, one key argument f or using SVMs has been the 



so-called "kernel trick" (IScholkopf et al.l . Il998l ). which decouples the SVM 



optimization problem from the domain of the samples, thus making it possi- 
ble to use SVMs on virtually any input space X. This flexibility is in strong 
contrast to more classical learning methods from both machine learning and 
non-parametric statistics, which almost always require input spaces X C R*^. 
As a result, kernel methods have been successfully used in various application 
areas that were previously infeasible for machine learning methods. As exam- 
ples we refer to (i) SVMs where using probability measures, e.g. histograms, 
as inp ut samples, have been used to analyze histogram data a nd coloured im- 



ages ( iHein and Bousquetl . l2005l ISriperumbudur et al.. 120091). fii) SVMs for 



text classiflcation and web mining ( Joachimsl . |2002| . iLafferty and Lebanonl . 
20051 ). and (in) SVM s with kernels from co mputational biology, e.g. kernels 
for trees and graphs (IScholkopf et al.l . 120041 ) . 

For a data set Z)„ = ((xi, yi), . . . , (x„, yn)), the empirical SVM is defined 
as 



1 
/l,d„,a := arg inf - V L{xi, yi, f{xi)) + A 



fdH n 



(1) 



That is, SVMs are based on three key components: (i) a convex loss function 
L : A" X 3^ X R — )■ [0, oo) used to measure the quahty of the prediction /(x), 
(a) a reproducing kernel Hilbert space (RKHS) H of functions / to specify 
the set of functions over which the expected loss is minimized, and (Hi) 
the regularization term A ||/||j:^ to reduce the danger of overfitting and to 
guarantee the existence of a unique SVM even if L is not strictly convex. 
The RKHS is often implicitly defined by specifying a kernel k : X x X ^ M.. 
Details about the definition of SVMs and some examples will be given in 



Section [21 

During the last years a great part of the statistical research on SVMs has 
concentrated on the central question how to choose the loss function L, the 
RKHS H or its kernel /c, and sequences of regularization parameters A„ to 
guarantee that SVMs are universally consistent and statistically robust for 
classification and regression purposes. In a nutshell, it turned out in a purely 
non-parametric setup that SVMs based on the combination of a Lipschitz 
continuous loss function and a bounded continuous kernel with a dense and 
separable RKHS are universally consistent with desirable statistical robust- 
ness properties for any probability measure P from whic h we observed the 



data s et, see, e.g., ISteinwart and ChristmannI ( l2008l ) and IChristmann et al. 



( I2OO9I ) for details. Examples are the combination of the Gaussian RBF-kernel 
with the pinball loss function for nonparametric quantile regression, with the 
e-insensitive loss function for nonparametric regression, or with the hinge loss 
function for nonparametric classification, see Section |2l 

Although a nonparametric approach is often the best choice in practice due 
to the lack of prior knowledge on P , a semiparametric approach or an addi- 



tive model fJFriedman and Stuetzld . Il98ll . iHastie and Tibshiranil . [19901) can 



also be valuable. For example, we may be interested due to practical reasons 
only in functions / which offer a nice interpretation because an interpretable 
prediction function can be crucial if the prediction f{x) has to be explain- 
able to clients. This can be the case if the prediction is the expected claim 
amount of a client and these predictions are the basis for the construction of 
an insurance tariff. Here we will mainly consider additive models although 
models with a multiplicative can also be of interest. More precisely, for some 
s G IN, the input space X is split up into s G IN non-empty spaces according 
to 

X = XiX ...X, (2) 

and only additive functions / : A:" — )■ R of the form 



/(a;i,...,x„) = /i(xi) H V fs{xs) , a;^ G A", 



3 ' 



are considered, where fj : Xj — >■ R for j G {1, . . . , s}. 

To our best knowledge, there are currently no results on consistency and 
statistical robustness published on SVMs based on kernels designed for addi- 
tive models. Of course, one can use one of the purely nonparametric SVMs 
described above, but the hope is, that SVMs based on kernels especially 
designed for such situations may offer better results. 



In this paper we address the question how to design specific SVMs for 
additive models. The main goal of this paper is that we give an exphcit 
construction principle of kernels — and thus of their RKHSs — which leads 
in combination with a Lipschitz continuous loss function to consistent and 
statistically robust SMVs for additive models. Examples are SVMs in ad- 
ditive models for quantile regression based on the pinball loss function, for 
regression based on the e-insensitive loss function, and for classification based 
on the hinge loss function. 

The rest of the paper is organized as follows. In Section |2] we collect some 
known results on loss functions, kernels and their RKHSs, and on support 
vector machines. These results are needed to state our results on consistency 
and statistical robustness of SVMs for additive models in Sectional Although 
we have so far no result on the rates of convergence, our numerical examples 
given in Section H] will demonstrate that SVMs based on kernels designed for 
additive models can easily outperform standard nonparametric SVMs if the 
assumption of an additive model is valid. Section [5] contains the discussion. 
All proofs are given in the Appendix. 

2 Background on support vector machines 

Let rY be a complete separable metric space and let 3^ be a closed subset 
of R. We will always use the respective Borel-cr-algebras. The set of all 
probability measures on the Borel-a- algebra of A" x 3^ is denoted by A^i(A:' x 
3^) . The random input variables Xi, . . . , X„ take their values in X and the 
random output variables Yi, . . . ,Yn take their values in 3^. It is assumed that 
{Xi, Yi), . . . , {Xn, Yn) are independent and identically distributed according 
to some unknown probability measure P G Aii{X x y) . Since 3^ C R 
is closed, P can be split into the marginal distribution Px on X and the 
conditional distribution P{- \x) oiY given x. 

The goal is to find a good predictor / : A" — )■ R which predicts the value y 
of an output variable after observing the value x of the corresponding input 
variable. The quality of a prediction t = f{x) is measured by a loss function 

L : X X y xR^ [0,oo) , {x,y,t) ^ L{x,y,t) . 

It is assumed that L is measurable and L{x, y,y) = for every {x,y) E X xy 
- that is, the loss is zero if the prediction t equals the actual value y of the 



output variable. In addition, we make the standard assumption that 

L{x,y,-) : R — )■ [0, oo) , t i— !■ L{x,y,t) 

is convex for every {x,y) & X xy and that additionally the following uniform 
Lipschitz property is fulfilled for some real number \L\i G (0, oo) : 

sup \L{x,y,t)- L{x,y,t')\ < |L|i-|t-t'| Vt,t'GR. (3) 

We restrict our attention to Lipschitz continuous loss functions because the 
use of loss functions which are not Lipschitz continuous (such as the least 
squares loss which is only locally Lipschitz con tinuous on unbounded do- 



mains ) usually conflicts with robustness; see, e.g.. lSteinwart and Christmann 



fl2008l . §10.4). 
The quality of a (measurable) predictor / : A" — ?■ R is measured by the 

risk 

-RlAI) = I L{x,yJ{x))V{d{x,y)). 
Jxy.y 

By different choices of 3^ and the loss function L, different purposes are 
covered by this setup - e.g. binary classification for y = { — 1; +1} and the 
hinge loss 

-^hingc(a:, y, t) := max{0, 1 - yt} , 

regression for 3^ = R and the e-insensitive loss 

L^{x, y, t) := max{0, \y — t\ — e} 
where e > 0, and quantile regression for 3^ = R and the pintail loss 

\T{y-t), ifa-«>0, ^ ' 

where r G (0, 1). 

An optimal predictor is a measurable function /* : ^ — )■ R which attains 
the minimal risk, called Bayes-risk, 

nip = inf TZlAD- 

measurable 



The optimal predictor in a set T of measurable functions / : A' — t- R is an 
/* G J-" which attains the minimal risk 

7^2,p,^ = inf 7^^,p(/) . 

For example, the goal of quantile regression is to estimate a conditional 
quantile function, i.e., a function /*p : Af — )■ R such that 

P((-cx),/;p(x)]|x) > r and P([/;p(x), oo) | x) > 1-r 

for the quantile r G (0, 1). If /^p G J-", then the conditional quantile function 
/^p attains the minimal risk 7^^ p t ^^ the pinball loss L^ (with parameter 
r) so that quantile regression can be done by trying to minimize the risk 
7^L^,P in T. 

One way to build a non-parametric predictor / is to use a support vector 
machine 

^,P,,:=arginf7^i,p(/) + A||/||?,, (5) 

where if is a reproducing kernel Hilbert space (RKHS) of a measurable kernel 
A; : A* X rY — 7- R, and A > is a r e gular iz ation parameter to reduce the 
danger of overfitting, see e. g .. IVapnikI (Il998[ ) , IScholkopf and Smolal (|2002[ ) or 



Steinwart and ChristmannI fJ2008[ ) for details. The reproducing property of k 



states that, for all / G -ff and all x G Af, 

/(x) = (/,$(x))h 

where ^ : X ^ H, x \-^ k{-,x) denotes the canonical feature map. A kernel 
k is called bounded, if 



sup 



\/k{x,x) < oo . 



Using the reproducing property and ||$(a:)||j:^ = ^Jk{x.,x), we obtain the 
well-known inequalities 

II/IIoo<II^IIooII/IIh (6) 

and 

II<^(^)IIoo<II^IIooII<^(^)IIh<II^IIL (7) 

for all / G -ff and all x G A". As an example of a bounded kernel, we mention 
the popular Gaussian radial basis function ( GRBF) kernel defined by 

/c^(x, x') = exp(— 7"^ ||x — x'll^d), x,x'eX, (8) 



where 7 is some positive constant and X C R*^. This kernel leads to a large 
RKHS which is dense in I/i(/x) for all probability measures /i on R''. We will 
also consider the polynomial kernel 

km,c{{x, x') = {{X, x')ud + c)™, X, x' e X, 

where m G (0,oo), c G (0, 00) and X C R'^. The dot kernel is a special 
polynomial kernel with c = and m = 1. The polynomial kernel is bounded 
if and only if X is bounded. 
Of course, the regularized risk 

is in general not computable, because P is unknown. However, the empirical 
distribution 

1 " 

corresponding to the data set Dn = ((xi, yi), . . . , (x„, i/„)) can be used as an 
estimator of P. Here 5{xi,yi) denotes the Dirac distribution in {xi,yi). If we 
replace P by D„ in ([5]), we obtain the regularized empirical risk TZ^^^ ^if) 
and the empirical SVM /l,d„,a- Furthermore, we need analogous notions 
where {xi,yi) is replaced by random variables (Xj,Kj). Thus, we define 

1 " 

i=l 

Then, for every u & Q, IDn(w) is the empirical distribution correspond- 
ing to the data set ((Xi(a;), 11(0;)), . . . , (X„(cj), l^(a;))) and, accordingly, 
^l%„,a(/) denotes the mapping fi ^ R, uj ^ ^l%„(^),a(/)' and /l,d„,a 
denotes the mapping fl -^ H, cj H- fL,n„{ui),\ ■ 

Support vector machines fL,p,\ need not exist for every probability measure 
P & M.i{X X y); for Lipschitz continuous loss functions it is sufficient for 
the existence of fL,p,\ that J L{x,y,0)P(^d{x,y)) < 00. This condition may 
be violated by heavy-tailed distributions P and, in this case, it is possible 
that 7lL,p{f) = 00 for every f E H. 

In order to enlarge the applicability of support vector machines to 
heavy-tailed distr i bution s, the following extension has b een deyeloped in 



Christmann et al.l (120091 ). Following an idea already used by lHuberl (119671 ) for 



M-estimates in parametric models, a shifted loss function L*:A:'x3^xR— >R 
is defined by 

L*{x,y,t) = L{x,y,t)-L{x,y,0) ^ ix,y,t) e X x y x R . 

Then, similar to the original loss function L, define the L*-risk by 

7^L^p(/) = JL*{x,y,f{x))P{d{x,y)) 

and the regularized L* - risk by 

t^Z'^pM) = ^^--Af) + All/Ill 

for every f E H . In complete analogy to (|5]) , we define the support vector 
machine based on the shifted loss function L* by 

^,P,A:=arginf7^i.,p(/) + A||/||^. (9) 

If the support vector machine /l,p,a defined by ([5]) exists, we have seemingly 
defined /l,p,a in two different ways now. However, the two definitions coincide 
in this case and t he fo llowing theorem summarizes some basic results of 



Christmann et all f l2009f ) 



Theorem 1. Let L he a convex and Lipschitz continuous loss function and 
let k be a bounded kernel. Then, for every P G Aii{X x y) and every 
X G (0, oo) , there exists a unique SVM fL,p,\ G H which minimizes TZ 



reg 
L*,P,X! 



I.e. 



7^L^P(/L,P,A) + nfL,vA?H = inf ^L.,P(/) + A||/||^. 

// the support vector machine fL,p,x defined by ^ exists, then the two defi- 
nitions Ij^ and ([^j coincide. 



3 Support vector machines for additive mod- 
els 

3.1 Model and assumptions 

As described in the previous section, the goal is to minimize the risk / i— )■ 
'R-L,p{f) in a set J-' of functions / : A' — > R. In this article, we assume an 
additive model. Accordingly, let 

X = Xi X ■ ■ ■ X Xg 



where Xi, . . . ,Xs are non-empty sets. For every j G {1, . . . , s}, let J^j be a 
set of functions /,■ : Xj — > R. Then, we only consider functions / : A" — t- R 
of the form 

f{xi,...,Xs) = fi{xi) H h fsixs) W{xi,...,Xs) e XiX ■■■ X Xs 

for fi e J^i, . . . , fs ^ J^s- Thus, 

J- := {/i + ■■■ + /,:/, G J-,, 1 < J < 4 . (10) 

In f lTO|) . we have identified /^ with the map A* — )■ R, (xi, . . . , Xg) H- fj{xj). 

Such additive models can be treated by support vector machines in a very 
natural way. For every j G {1, . . . , s}, choose a kernel kj on Afj with RKHS 
Hj. Then, the space of functions 

H := {fi + --- + fs : fjeH, l<j<s} 

is an RKHS on X = XiX ■ ■ -x Xg with kernel k = ki + - ■ ■ + ks; see Theorem[2] 
below. In this way, SVMs can be used to fit additive models and SVMs enjoy 
at least three appealing features: First, it is guaranteed that the predictor has 
the assumed additive structure {xi, . . . ,Xs) H- f i{xi) + ■■■ + fs{xs)- Second, 
it is p ossible to still use the stan dard SVM machinery including the kernel 
trick ( IScholkopf and Smolal . |2002| . § 2) and implementations of SVMs - just 



by selecting a kernel k = ki + ■ ■ ■ + kg- Third, the possibility to choose 
different kernels ki, . . . ,ks offers a great flexibility. For example, take s = 2 
and let ki be a GRBF kernel on M!^^ and /c2 be a GRBF kernel on R'^^ . Since 
the RKHS of a Gaussian kernel is an infinite dimensional function space, we 
get non-parametric estimates of /i and /2. As a second example, consider a 
semiparametric model with / = /i + /2 where /i : Xi i— )■ fi{xi) is assumed 
to be a polynomial function of order at most m and f2 '■ ^2^^ f2{x2) may be 
some complicated function. Then, this semiparametric model can be treated 
by simply taking a polynomial kernel on R''^ for ki and a GRBF kernel on 
R'^^ for k2. This can be used, for example, in order to model changes in 
space (for di < 3 and Xi specifying the location) or in time (for di = 1 and 
Xi specifying the point in time). 

Theorem 2. For every j G {1, . . . , s}, let Xj be a non-empty set and 



be a kernel with corresponding RKHS Hj. Define k = ki + ■ ■ ■ + kg- That is, 

k{{xi, ...,Xs), {x[, ..., x'J) = ki{xi, x'J H h ks{xs, x'J 

for every Xj,x'j G Xj, j G {1, . . . , s}. Then, k is a kernel on X = XiX- ■ -xXg 
with RKHS 

H := {fi + --- + fs : f,eH, l<j<s} 

and the norm of H, given in ^, fulfills 

\\fi + --- + fs\\l < \\fi\L+ ■■■+]] fs\\l V/iGifi,...,/,GiJ,. (11) 

II \ i 11 II I I 11 I II I I J^s 

If not otherwise stated, we make the following assumptions throughout 
the rest of the paper although some of the results are also valid under more 
general conditions. 

Main assumptions 

(i) For every j G {l,...,s}, the set Xj is a complete, separable metric 
space; kj is a continuous and bounded kernel on Xj with RKHS Hj. 
Furthermore, k = ki-\ — ■ + ks denotes the kernel on X = XiX ■ ■ ■ x Xg 
defined in Theorem [H and H denotes its RKHS. 

(a) The subset y C H is closed. 

(Hi) The loss function L is convex and fulfills the uniform Lipschitz continu- 
ity ^ with Lipschitz constant \L\i G (0, oo). In addition, L{x, y,y) = 
for every {x,y) E X x y. 

Note that every closed subset of M!^ is a complete, separable metric space. 
We restrict ourselves to Lipschitz continuous loss functions and continuous 
and bounded kernels because it has been shown earlier that these assump- 
tions are necessary in order to e nsure good robustness properties; see e.g. 
Steinwart and ChristmannI (|2008| . § 10). The condition L(x, y,y) = is quite 



natural and practically always fulfilled - it means that the loss of a correct 
prediction is 0. Our assumptions cover many of the most interesting cases. 
In particular, the hinge loss (classification), the e-insensitive loss (regression) 
and the pinball loss (quantile regression) fulfill all assumptions. Many com- 
monly used kernels are continuous. In addition, the Gaussian kernel is always 
bounded, the linear kernel and all polynomial kernels are bounded if and only 
if Xj is bounded. From the assumption that the kernels kj are continuous 
and bounded on Xj, it follows that the kernel k = ki + . . . kg is continuous 
and bounded on X. 

10 



3.2 Consistency 

SVMs are called universally consistent, if the risk of the SVM estimator 
/l,d„,a„ converges, for all probability measures P, in probability to the Bayes- 
risk, i.e. 

7^L^P(/L,D„,AJ ^7^^,p (n^oo). (12) 

In order to obtain universal consistency of SVMs, it is necessary to choose a 
kernel with a large RKHS. Accordingly most known results about universal 
consistency of SVMs assume t hat the RKHS is dense in C{X) where A* is a 



compact metric space (see e.g. ISteinwartI fl200ll )) or, at least, that the RKHS 
is dense in Lg{Px) for some q ^ [1, oo). In this paper, we consider an additive 
model where the goal is to minimize the risk / i— )■ TZL,p{f) in the set 

J" = {/i + ■■■ + /. : fj^J^j, l<J<s}. 

For the consistency of SVMs in an additive model, we do not need that the 
RKHS H = Hi + ■ ■ ■ + Hg is dense in the whole space Lg{Px)] instead, we 
only assume that each Hj is dense in J^j. As usual, £q(/i) denotes the set 
of all g-integrable real- valued functions with respect to some measure fi and 
Lg{fi) denotes the set of all equivalence classes in Cq{fi). Theorem [3] shows 
consistency of SVMs in additive models. That is, the L*-risk of fL,Dn,Xn 
converges in probability to the smallest possible risk in J-". 

Theorem 3. Let the main assumptions (p. Ud\) be valid. Let P & Aii{X xy) 
such that 

Hj C Tj C Ci{Fx,), l<J<s, 

and let Hj be dense in Tj with respect to \\ ■ \\li{Px]- Then, for every sequence 
(An)nG]N C (0, oo) such that lim„^oo A„ = and lim„^oo A^ra = cxo, 

7^l*,p(/l,d„,aJ ^ ^I*,p,.F (n^oo) 

in probability. 

In general, it is not clear whether convergence of the risks implies conver- 
gence of the SVM /l,d„,a„- However, the following theorem will show such 
a convergence for quantile regression in an additive model - under the con- 
dition that the quantile function f*p actually lies in J^ = Ti + ■ ■ ■ + J^s- In 
order to formulate this result, we define 

do{f,g,) = y"min{l, \f-g\}dP;, 
11 



where f,g:X^M, are arbitrary measurable functions. It is known that do 
is a metric describing convergence in probabihty. 

Theorem 4. Let the main assumptions (p. [7Z^) be valid. Let P G A^i(A' x y) 
such that 

H, C J-, C CiiP;r,) Vj €{!,..., s} 

anc? -f/j- zs dense in Tj with respect to \\ ■ ||li(P;v^.)- Let r G (0, 1) and assume 
that the quantile function /*p is Px - almost surely unique and that 

/;,P e-F. 

Then, for the pintail loss function L = L^- and for every sequence {Xn)ne¥i C 
(0, oo) such that hm„_>.oo A^ = and hm„_>.oo A^n = oo, 



do{fL,i)n,\„, fr,p) — ^ 



[n — )■ oo 



in probability. 



3.3 Robustness 

During the last years some general results on the statistical robustness prop- 
erties of SVMs have been shown. Many of these results are directly ap- 
plicable to SVMs for additive models if the kernel is bounded and contin- 
uous (or at least measurable) and the loss function is Lipschitz continu- 
ous. For brevity we only give upper bounds for the bias and the Bouli- 
gand influence function for SVMs, which are both even applicable for non- 
smooth loss functions lik e the pinball loss for quantile regression , and refer to 
Christmann et al.l (120091 ) and lSteinwart and Christmanij (120081. Chap. 10) for 
results on the classical influence func tion proposed by (JHampell . Il968l . Il974l ) 
and to iHable and Christmannl (120091 ) for qualitative robustness of SVMs. 
Define the function 



T:Mi{Xxy)^H, T(P):=/ 



L,P,A 



(13) 



which maps each probability distribution to its SVM. In robust statistics we 
are interested in smooth and bounded functions T, because this will give us 
stable SVMs within small neighborhoods of P. If an appropriately chosen 
derivative of T(P) is bounded, then we expect the value of T(Q) to be close 
to the value of T(P) for distributions Q in a small neighborhood of P. 



12 



The next result shows that the iiT-norm of the difference of two SVMs 
increases with respect to the mixture proportion e G (0, 1) at most hnearly 
in gross-error neighborhoods. The norm of total variation of a signed measure 
/i is denoted by ||yu||x. 

Theorem 5 (Bounds for bias). // the main assumptions (p. [70j) are valid, 
then we have, for all A > 0, all e E [0, 1], and all probability measures P and 
Gl on X xy, that 



|T(Q)-T(P)L < c|P-Q|^,, 


(14) 


|T((1-£)P + £Q)-T(P)L < c|P-QU£, 


(15) 


\ \kL L\i- 





where c 

Because of ([7]), there are analogous bias bounds of SVMs with respect to 
the norm in iJ, if we replace c by c := ^ ll^lloo l-^li- 

While F.R. Hampel's influence function is related to a Gateaux-derivative 
which is linear, the Bouligand influence function is related to the Bouligand 
derivative which needs only to be positive homogeneous. Because this weak 
derivative is less known in statistics, we like to recall its definition. Let Ei 
and E2 be normed linear spaces. A function f : Ei -^ E2 is called posi- 
tive homogeneous if f{ax) = af{x) for all a > and for all x E Ei. If f/ 
is an open subset of Ei, then a function / : f/ — )■ £"2 is called Bouligand- 
dijferentiable at a point xq G U, if there exists a positive homogeneous func- 
tion V'^/(xo) : U ^ E2 such that 

\\f(cco + h)-f{xo)-V^fixo)ih)\\ 
lim ■ ^ — n 



see 



Robinson! (119911 ). 

The Bouligand influence function (BIF) of the map T : Aii{A! x y) -^ H 
for a distribution P in the dir e ction of a distribution Q 7^ P was defined by 



Christmann and Van MessemI (120081 ) as 



||r((l-e)P+£Q)-r(P)-BlF(Q;r,P)||^ 

lim ^ = 0. (Id) 

e\a e 

Note that the BIF is a special Bouligand-derivative 

||T((P + e(Q - P)) - T(P) - BIF(Q; T, P)||^ 

1™ iTTT^; HVTi = 

||e(Q^P)||^0 ||£:(Q-P)|| 

13 



due to the fact, that Q and P are fixed, and it is independent of the norm 
on Jiii{X X y). The partial Bouhgand derivative with respect to the third 
argument of L* is denoted by Vf L*(a;, y, t). The BIF shares with F.R. Ham- 
pel's infiuence function the interpretation that it measures the impact of an 
infinitesimal small amount of contamination of the original distribution P 
in the direction of Q on the quantity of interest T(P). It is thus desirable 
that the function T has a bounded BIF. It is known that existence of the 
BIF implies existence of the IF and in this case they are equal. The next 
result shows that, under some con ditions, the Bouligand infl uence function of 
SVMs exists and is bounded, see IChristmann et al.l (120091 ) for more related 
results. 

Theorem 6 (Bouligand influence function). Let the main assumptions (p. 
\T^) be valid, but assume that X is a complete separable normed linear space\j 
Let P,Q, & Aii{X X y). Let L be the pinball loss function Lr with t E (0, 1) 
or let L be the e-insensitive loss function L^ with e > 0. Assume that for 
all 6 > there exist positive constants ^p, C,q, cp, and cq such that for all 
t e R with \t — fL,p,\{x)\ < S\\k\\^ the following inequalities hold for all 
a e [0,25||/i;||^] and X G X: 

F{[t,t + a]\x) < Cpa^+^P and Q{[t,t + a]\x) < Cqa^^^'^ . (17) 

Then the Bouligand infiuence function BIF(Q;T, P) ofT(P) := fL,p,x exists, 
is bounded, and equals 

i(EpVfL^(X,r,^,P,;,(X))<I>(X)-EQVfL^(X,y,/i,P,;,(X))$(X)) . (18) 

Note that the Bouligand influence function of the SVM only depends on Q 
via the second term in flT8|) . The interpretation of the condition f lT7|) is that 
the probability that Y given x is in some small interval around the SVM is 
essentially at most proportional to the length of the interval to some power 
greater than one. 

For the pinball loss function, the BIF given in ( !T8|) simplifles to 



^ J (P((-oo, /l,p,a(x)] I x) - r)$(x) P;,{dx) 
-— / (Q((-oo,/l,p,a(x)]|x) -T)<!?{x)Q:v{dx). 



(19) 



^Bouligand derivatives are only defined in normed linear spaces. E.g., X C R"^ a linear 
subspace. 
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The BIF of the SVM based on the pinball loss function can hence be inter- 
preted as the difference of the integrated and with ^ $(x) weighted difference 
between the estimated quantile level and the desired quantile level r. 

Recall that the BIF is a special Bouligand derivative and thus positive 
homogeneous in h = e(Q — P). If the BIF exists, we then immediately obtain 

fL,{l-ae)P+aeQ,X " /l,P,A = T(P + ah) - T(P) 

= aBIF(Q;T,P) + o(a/i) (20) 

= a{T{F + h)-T{F) + o{h))+o{ah) 

= a{fL,{i-e)p+eQ,x " /l,p,a) + o{ae{Q - P)) 

for all a > 0. This equation gives us a nice approximation of the asymptotic 
bias term /L,(i-£)p+eQ,A — /l.p.a, if we consider the amount ae of contamina- 
tion instead of e. 

4 Examples 

In this section we would like to illustrate our theoretical results on SVMs 
for additive models with a few finite sample examples. The goals of this 
short section are twofold. We like to get some preliminary insight how SVMs 
based on kernels designed for additive models work for finite sample sizes 
when compared to the standard GRBF kernel defined on the whole input 
space and to get some ideas for further research on this topic. We also like 
to apply support vector machines based on the additive kernels treated in 
this paper to a real-life data set. 

4.1 Simulated example 

Let us consider the following situation of median regression. We have two 
independent input variables Xi and X2 each with a uniform distribution 
on the interval [0,1] and the output variable Y given x = (xi,a;2) has a 
Cauchy distribution (and thus not even the first moment does exist) with 
center f{xi,X2) := fi{xi) + hix^), where /i(xi) := 1 + bxl and f2{x2) : = 
sin(5x2) cos(17x2). Hence the true function / we like to estimate with SVMs 
has an additive structure, where the first function is a polynomial of order two 
and the second function is a smooth and bounded function but no polynomial. 
Please note, that here X = [0, 1]^ is bounded whereas 3^ = R is unbounded. 
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As X is bounded, even a polynomial kernel on X is bounded which is not 
true for unbounded input spaces. We simulated three data sets of this type 
with sample sizes n = 500, n = 2,000, and n = 10,000. We compare the 
exact function / with three SVMs fL,T),\n fitted by the three data sets, where 
we use the pinball loss function with r = 0.5 because we are interested in 
median regression. 

• Nonparametric SVM. We use an SVM based on the 2-dimensional 
GRBF kernel k defined in ([8]) to fit / in a totally nonparametric man- 
ner. 

• Nonparametric additive SVM. We use an SVM based on the kernel 
k = ki + k2 where ki and /c2 are 1-dimensional GRBF kernels. 

• Semiparametric additive SVM. We use an SVM based on the kernel 
k = ki + k2 where ki is a polynomial kernel of order 2 to fit the function 
/i and /c2 is a 1-dimensional GRBF kernel to fit the function /2. 

Our interest in these examples is to check how well SVMs using kernels de- 
signed for additive models perform in these situations. No attempt was made 
to find optimal values of the regularization parameters A and the kernel pa- 
rameter 7 by using a grid search or cross-validation, because we did not 
want to mix the quality of such optimization strategies with the choice of the 
kernels. We therefore fixed 7 = 2 and used the simple non-stochastic spec- 
ification A„ = 0.05n~^'^^ for the regularization parameter which guarantees 
that our consistency result from Section [3] is applicable. 

From Figures [1] to [3] we can draw the following conclusions for this special 
situation. 

i) If the additive model is valid, all three SVMs give comparable and 
reasonable results if the sample size n is large enough even for Cauchy 
distributed error terms, see Figure [H This is in good agreement with 
the theoretical results derived in Section [31 

ii) If the sample size is small to moderate and if the assumed additive 
model is valid, then both SVMs based on kernels especially designed 
for additive models show better results than the standard 2-dimensional 
GRBF kernel, see Figures [2] and Figure O 

iii) The difference between the nonparametric additive SVM and semi- 
parametric additive SVM was somewhat surprisingly small for all three 
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sample sizes, although the true function had the very special structure 
which is in favour for the semiparametric additive SVM. 



4.2 Example: additive model using SVMs for rent 
standard 

Let us now consider a real-life example of the rent standard for dwellings in a 
large city in Germany. Many German cities compose so-called rent standards 
to make a decision making instrument available to tenants, landlords, renting 
advisory boards, and experts. Such rent standards can in particular be used 
for the determination of the local comparative rent, i.e. the net rent as a 
function of the dwelling size, year of construction of the house, geographical 
information etc. For the construction of a rent standard, a representative 
random sample is drawn from all households and questionna ires are used to 



determ ine the relevant information by trained interviewers. iFahrmeir et al 



( 120071 ) described such a data set consisting of n = 3,082 rent prizes in Munich, 
which is one of the largest cities in Germany. They fitted the following 
additive model 

price = /i(size) + /2(year) + /3o + /3i regioni + (32 region2 + error, 

where the following variables were used: 
price 
size 
year 
regioni 
region2 



net rent price per square meter in DM [1 € ~ 1.96 DM] 
size in square meter of the dwelling [between 20 and 160] 
year [between 1918 and 1997] 
good residential area [0=no, l=yes] 
best residential area [0=no, l=yes]. 



Hence regioni an d region^ are du r nmy v ariables with respect to a standard 



residential area. iFahrmeir et al.l (120071 ) used a special spline method for 



estimating the functions /i and /2. 

For illustration purposes of SVMs with additive kernels investigated in the 
present paper, we used a nonparametric additive SVM for median regression. 
More precisely, we used the pinball loss function with r = 0.5 and the kernel 



k{x,x') = y^kj{xj,x'j), X = (xi,X2,a;3,X4) G IR^, x' = {x[, X2, x'^, x'^) E R' 
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where 

Gaussian RBF kernel with 7 = 1 for size 

Gaussian RBF kernel with 7 = 1 for year 

dot kernel for regioni 

dot kernel for region2. 
In analogy to the simulated examples given above, the regularizing param- 
eter was again set to A„ = 0.05n~°'^^ = 0.00135 such that our theoretical 
results are applicableo 

The left plot in Figure H] shows the estimated median net rent price of 
one square meter depending on the size of the dwelling and the year of the 
construction for a dwelling in a standard residential area. The plot shows 
that the median of the net rent prices per square meter is fairly stable for 
construction years up to 1960, but a more or less linear increase is visible 
for newer buildings. The plot also shows that the median of the net rent 
prices per square meter is especially high for dwellings of size less than 80 
square meter, that the price is nearly constant for sizes between 80 and 140 
square meter, and then a slight increase of the square meter prize seems 
to occur for even larger dwellings. The slope parameters were estimated 
by $1 = 1.38 for good residential area (regioui = 1) and /3i = 3.46 for 
best residential area (region2 = 1). Hence, we obtain apart from these level 
shifts the same surfaces for dwellings located in good or in best residential 
areas. We would like to mention that we used this real-life example just for 
illustration purposes, but nevertheless our results are in good agreement with 



the m ore detailed statistical analysis of this data set made by lFahrmeir et al 



( I2OO7I ) who used different statistical techniques. 

From an applied point of view, one may also be interested in the 10% 
highest net rent prices depending on the four explanatory variables. We 
therefore repeated our computations using the same kernel but instead of 
r = 0.50 for median regression we used r = 0.90 to obtain estimates for 
the 90% quantiles of the net rent prizes depending on the four explanatory 
variables. The right plot in Figure H] shows the estimated 90% quantile net 
rent prices of one square meter depending on the size of the dwelling and 
the year of the construction for a dwelling in a standard residential area. 
The slope parameters were estimated by Pi = 1.59 for good residential area 
(regioui = 1) and /3i = 4.24 for best residential area (region2 = 1). The 



^Some numerical computations showed that the SVM resuhs were fairly stable with 
respect to other choice of A„, e.g. 4A„, for this particular data set and we will hence only 
show the results A„ = 0.05n~'^'^^. 



shape of the surface is quite similar to the shape of the surface in the previous 
plot for the estimated median net rent prices. However, the plot may give 
an indication for a moderate peak for the 90% quantile net rent prices for 
dwellings of size 100 square meter. 

5 Discussion 

Support vector machines belong to the class of modern statistical machine 
learning methods based on kernels. The success of SVMs is partly based on 
on the kernel trick which makes SVMs usuable even for abstract input spaces, 
their universal consistency, their statistical robustness with respect to small 
model violations, and on the existence of fast numerical algorithms. During 
the last decade there has been considerable research on these three topics. To 
obtain universal consistency one needs a sufficiently large reproducing kernel 
Hilbert space if, such that many interesting SVMs are based o n Hilbert 



space s with infinite dimension. Due to the no-free-lunch theorem (JDevroyd . 



19821 ). there exists in general no uniform rate of convergence of SVMs on the 
set of all probability measures. 

Although such a nonparametric approach is often the best choice in prac- 
tice due to the lack of prior knowledge on the unkno wn probability measure 



P, a s emiparametric approach or a n additive model (jFriedman and Stuetzle 



19811 . iHastie and Tibshiranil . Il990l ) can also be valuable for at least two rea- 
sons: (i) In some applications some weak knowledge on P or on the unknown 
function / to be estimated, say the conditional quantile curve, is known, e.g. 
/ is known to be bounded or at least integrable. (ii) Due to practical reasons, 
we may be interested only in functions / which offer a nice interpretation 
although there might be a measurable function with a smaller risk, because 
an interpretable prediction function can be crucial in some applications. An 
important class of statistical models whose predictions are relatively easily 
to interpret are additive models. 

Therefore, support vector machines for additive models were treated in 
this paper and some results on their consistency and statistical robustness 
properties were derived. 

Some simple numerical examples showed that SVMs based on kernels es- 
pecially designed for an additive model can yield better predictions than the 
standard SVM based on the classical Gaussian RBF kernel, if an additive 
model is indeed valid. 
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It may be worthwhile to investigate the rates of convergence of SVMs 
based on kernels designed for additive models with SVMs based on standard 
kernels, because our simple numerical examples seem to indicate that there 
might be some gain with respect to the rate of convergence. However, this is 
beyond the scope of this paper. 

We would like to mention the well-known fact that not only the sum of 
s kernels is a kernel but also the product of s kernels is a kernel. Hence 
it seems to be possible to derive similar results than those given here for 
additive models also for multiplicative models. 

Finally, we would like to mention that there are of course many other sta- 
tistical estimation techniques for additive models, e.g. splines and boosting, 
but a comparision of these methods with SVMs based on additive kernels is 
beyond the scope of this paper. 

Appendix: Proofs 

Proof of Theorem\^ First fix any j G {1, . . . , s} and define the mapping 
kj : X X X ^ 'R via 

"^i il-^i' •••' -^'5/' l-^i' •••' -^sJj ~ kj[Xj,Xj) 

for every [xi, . . . ,Xs) G X and {x[, . . . ,x'g) G X. Accordingly, for every 
fj G Hj, define fj-.X^H via 

fj{xi,...,Xs) = fj{xj) y{xi,...,Xs)eX. 

Then, it is easy to see that 

H, = {fr.X^R: f,eH,] 

is a Hilbert space with inner product and norm given by 

{fj^k)H, = {fj^hj)H, and ||/j||^, = ||/j|| (1) 

for every fj G Hj and gj G Hj. Hence, for every x = {xi, . . . ,Xs) G X, we 
get kj{-,x) G Hj and 

fji^) = fji^j) = {fj, kj{-, Xj))h, = {fj, ~kj{-, x))h^ "^ fj e Hj 
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where the last equahty follows from ([T]) and the definition of kj. That is, kj 
is a reproducing kernel and Hj is its RKHS. 



Next, it follows from (JBerlinet and Thomas- A gnaru . |2004| . §4.1) that k 



ki + ■ ■ ■ + kg is a. reproducing kernel on X with RKHS H = Hi + ■ ■ ■ + Hs 
and norm 

IUI|2 

WtWrr = nun 

f = /! + ■■■ + fa 

heH^ /se-Hs 

ID 

= mm 

/ = /l + '-' + /s 

fieHi,...,fseHs 

Using the reduced notation /i + ■ ■ ■ 
dm) follows. 



/l 


k+- 


■ + 


f ' 


= 




/l 


k+- 


■ + 


^' Hs 




(2) 


/. 


instead of /i + ■ ■ 


+ fs, 


inequality 

D 



In order to prove Theorem [3l the following proposition is needed. It pro- 
vides conditions on Hj and Tj under which the minimal risk over H = 
Hi + ■ ■ ■ + Hg is equal to the minimal risk over the larger T = Ti + ■ ■ ■ + Tg. 



Proposition 7. Let the main assumptions (p. lT^) be valid. LetP G A^i(A:'x 
y) such that 

H, C J-,- C £i(P;,J VjG{l,...,s} 

and Hj is dense in Tj with respect to || ■ ||Li(P;t.,)- Then, 

lll.^,H := inf 7^^^p(/) = n.,p,^. (3) 

Proof of Proposition [^ According to the definitions, it only remains to 
prove 7^2* p h — ^1* p j^ ■ To this end, take any f E J-' and any e > 0. 
Then, by assumption there are functions 

fjETj, je{l,...,n}, 

such that / = /i + ■ — \- fs and, for every j G {1, . . . , s}, there is an hj G Hj 
such that 

ll^7-/illrm > < — rrr ■ (4) 



J ' 
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Hence, for h = hi 



h, e H, 



|7^L^p(/i)-7^L.,p(/)| < J \L{x,y,h{x)) - L{x,yJ{x)\F{d{x,y)) 



< |L|i / \h{x)-f{x)\Pp,{dx) < 



\L\iY.lh 

.7 = 1 -^ 



{xj)- fj{xj)\Vx^{dxj 



< e 



U 



Proof of Theorem To avoid handling too many constants, let us assume 
||A;||^ = 1. According to ©, this implies ||/||^ < WfW^ for all f ^H. Now 
we use the Lipschitz continuity of L to obtain, for all g E H, 

|7^L^p(/L,p,AJ-7^L^p(^)| < 

< / \Lix,y, fL,P,x„ix)) - L{x,y,g{x))\F{d{x,y)) 

< \L\i \fL,P,x„{x)-g{x)\Fx{dx) < \L\i \\fL,P,x„ - g\\^Px{dx) 



< \L\i ||/l*,p,a„ -^11^ 



(5) 



Let $ denote the canonical feature rn ap which corresponds to the kernel k. 
According to IChristmann et al.l (120091 Theorem 7), for every n G IN, there is 
a bounded, measurable function /i„ : A' x 3^ — )■ R such that 

\\hn\L < \L\i (6) 

and, for every Q e Mi{X x y), 

||/l,P,A„ - /l,Q,aJIh < Xn' l|Ep/in$ " Eq/i„$||^ . (7) 

Fix any 5 G (0, 1) and define 

Bn := Id^ e (A" X 3^)" : ||Ep/i„$ - Ej,,^h^<^\\^ " ^} ^^^ 

where D„ denotes the empirical distribution of the data set Dn. Then, (|5]), 
dZD and dHD yield 



|7^L^p(/L,p,AJ - 7^L^p(/L,D„,AJ| < e \f D^, e B, 



(9) 
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Now let us turn over to the probability P"(-B„). By use of Hoeffding's in- 
equality, we will show that 



lim P"(5„) 



1 . 



(10) 



To this end, we first observe that A.„n^/^ —^ oo implies that A„£: > ri"^/^ for 
all sufficiently large n G IN. Moreover, (|6]) and our assumption \\k\\^ = 1 
yield ||/in$ll^ < l-^li- Define 

•J II "' II oo — I I -*- 

an := \L\^'e\n and ^„ := - ' '^ « _ 



|L|^^eAn + 3 8a„ + 3 



and note that, for sufficiently large n, 

a„ a/3 



/n 3n 



a^ 



On 



2 A/a„ + 3 A/n 2 a„ + 3 

1 a. 1 _ |.|„i , 



(Xn J- On -L 

< r= H ■ - < a. 

2 v^ 2 3 



Consequently, Hoeffding 's inequality in Hilbert spaces 

Steinwart and ChristmannI (120081 . Corollary 6.15)) yields for B 
the bound 

P"(5„) = P" (Id e{Xx yy -. ||Ep/i„$ -Ed/i„$||^ < ^j") 2 

CD / f a/2? +1 4f 

> P" N D G (A" X 3^)" : ||Ep/i„$ - Eo/in^ll^ < ^^ + ^ 



(see 

: 1 



> l-eM-- ^''""^"l' 



eXn/\L\i + 3 



l-exp(-^ 



n 3n 

e'^Xln 



8 (£A„ + 3|L|i)|L| 



for all sufficiently large values of n. Now (ITO!) follows from A„ — t- and 
A^n^/2 -)■ oo. According to ^ and (fTOl) . 



^l*,p(/l,p,a„) - '7^l*,p(/l,d„,a„, 
in probability. Note that. 



(ra — )■ oo) 



^l*,p(/l,d„,a„) - "^ 



L*,P,J^ 



< 



< 

|3J 
< 



'^L*,P,H '^L*,P,T 



'^L*,p(/l,D„,A,J - T^L*,P,H 

'^L*,p(/l,D„,A,J -'^L*,p(/l,P,A„) + 'R'L*,pifL,P,xJ -T^*L*p,H (12) 
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As shown above, the first term in (1121) converges in probability to 0. There- 
fore, it only remains to prove that the second term converges to 0. To this 
end, define, for every f E H, the affine linear function 

R -^ R, X ^ 7^L^p(/) + A||/||i-7^2.,p,^. 

Then, a continuity result for the pointwise infimum of a family of affine 
functions (see e.g. (jSteinwart and Christmanru . |2008| . A. 6. 4)) yields 



A* 



lim inf A*A\r. 



inf A*f(0) 



However, according to the definitions, 

'■ '^L*,p(/l,P,A„) + K\\fL,P,X„\\H 

. Hence, 



inf A*f(Xr 



n 



L*.F..H 



VnelN 



and inf A*M) 
feH ^' ' 



< limsup (7^L*,p(/L,p,A„) -7^2*,p,i^) < 

< lim sup ( inf A}(A„) - inf A*f{0)) = 







D 

Proof of Theorem^ Since the quan t ile funct i on f^ p attains the minimal 
risk 7^.2* p for the pinball loss L = L^ (JKoenkerl . l2005l . § 1.3), the assumption 
f*p G J-' implies T?.^* p f = ^I* p- Hence, an application of Theorem |3] yields 

7^L^P(/L,D„,AJ -^ n*,P i n ^ oo) (13) 

in probability. It is shown in f Christmann et al.l . l2009l Corollary 31) that, 
for all sequences (/n)ne]N of measurable functions /„ : A" — )■ R, 

T^L*Afn) > T^*L*,P implies do{fnJr,p) — > 0. 

This proves Theorem |4] in the following way: According to the characteri- 
zation of conv ergence i n pro bability by means of almost surely convergent 
subsequences (JDudleyl . |2002| . Theorem 9.2.1), it follows from ( 1T5]) that, for 
every subsequence of 7^l*,p(/l,d„,a„)5 "^i G IN, there is a further subsequence 
which converges almost s urely to 7^2* p- Hence, according to the cited result 
( IChristmann et al.l . l2009l Corollary 31), for every subsequence of 

c?o(/l,d„,a„,/;,p), nelN, 

there is a further subsequence which converges almost surely to 0. That is, 
c?o(/l,d„,a„, /*p) -^ in probability. D 
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Figure 1: Quantile regression using SVMs and pinball loss function with 
T = 0.5. Model: Y\{xi,X2) ~ fii^i) + f-2{x2) + Cauchy-errors, where 
/i(xi) := 7 + bx\ and f2{x2) = sin(5x2) cos(17x2) and Xi and X2 are obser- 
vations of independent and identically uniform distributed random variables 
on the interval [0,1]. The regularization parameter is A„ = O.OSn"'''^^, and 
the kernel parameter of the Gaussian RBF kernel is 7 = 2. Upper left 
subplot: true function f{xi,X2) = fi{xi) + f2{x2). Upper right subplot: 
SVM fit based on GRBF kernel konX = R'^. Lower left subplot: SVM fit 
based on the sum of two 1-dimensional GRBF kernels. Lower right subplot: 
SVM fit based on the sum of a 1-dimensional polynomial kernel on R and a 
1-dimensional GRBF kernel. 
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Figure 2: Quantile regression using SVMs and pinball loss function with 
T = 0.5. Model: Y\{xi,X2) ~ fii^i) + f-2{x2) + Cauchy-errors, where 
/i(xi) := 7 + bx\ and f2{x2) = sin(5x2) cos(17x2) and Xi and X2 are obser- 
vations of independent and identically uniform distributed random variables 
on the interval [0,1]. The regularization parameter is A„ = O.OSn"'''^^, and 
the kernel parameter of the Gaussian RBF kernel is 7 = 2. Upper left 
subplot: true function f{xi,X2) = fi{xi) + f2{x2). Upper right subplot: 
SVM fit based on GRBF kernel konX = R'^. Lower left subplot: SVM fit 
based on the sum of two 1-dimensional GRBF kernels. Lower right subplot: 
SVM fit based on the sum of a 1-dimensional polynomial kernel on R and a 
1-dimensional GRBF kernel. 
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Figure 3: Quantile regression using SVMs and pinball loss function with 
T = 0.5. Model: Y\{xi,X2) ~ fii^i) + f-2{x2) + Cauchy-errors, where 
/i(xi) := 7 + bx\ and f2{x2) = sin(5x2) cos(17x2) and Xi and X2 are obser- 
vations of independent and identically uniform distributed random variables 
on the interval [0,1]. The regularization parameter is A„ = O.OSn"'''^^, and 
the kernel parameter of the Gaussian RBF kernel is 7 = 2. Upper left 
subplot: true function f{xi,X2) = fi{xi) + f2{x2). Upper right subplot: 
SVM fit based on GRBF kernel konX = R'^. Lower left subplot: SVM fit 
based on the sum of two 1-dimensional GRBF kernels. Lower right subplot: 
SVM fit based on the sum of a 1-dimensional polynomial kernel on R and a 
1-dimensional GRBF kernel. 



n 



500 



true function 



nonparametric SVM 





0.0 O-O 



0.0 0-0 



nonparametric additive SVM 



semiparametric additive SVM 





0.0 0.0 



0.0 0.0 



Figure 4: Plot for the fitted additive model for the rent standard data set 
based on a nonparametric additive SVM for quantile regression, i.e., pinball 
loss function with r = 0.50 (left) and r = 0.90 (right). The surface gives the 
estimated median (left) or 90% quantile (right) net rent price of one square 
meter depending on the size of the dwelling and the year of the construction 
for a standard residential area, i.e., regioui = region2 = 0. 
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